We aim at developing a method to align and compare topics when
the number of topics is changed (varying K)
hyper-parameters of LDA are changed (e.g. varying alpha)
different modalities [? is there a better term than “modality” ?] exist for the same documents. For example, the same set of documents exists in different languages but we don’t have a direct translation of each word. Or, more commonly encountered in biology, the same samples have been analyzed for different -omics information, e.g. metagenomic or transcriptomic, and there is a desire to compare the topics from these different domains.
different datasets composed of different documents exist with the same modality, i.e. the same words are present in both datasets. For example, one dataset is the collection of articles from The New York Times and the other dataset is the collection of articles from The Guardian and we wish to compare the topics identified in these two datasets. In microbiome biology, we may want to compare topics obtained from two different cohorts. For example, the vaginal microbiome composition of pregnant (cohort 1) and non-pregnant (cohort 2) women has been measured by rRNA 16S sequencing and we want to compare the communities identified in these two cohorts.
Models
We aim to compare the topics of \(M\) LDA models. Each specific model is denoted by \(m \in [1:M]\).
Topics
Each model \(M\) has \(K\) topics. Each topic is denoted by \(k \in [1:K]\).
Documents (Samples)
The dataset is composed of \(D\) documents (or samples). Each document/sample is denoted by \(d \in [1:D]\).
Words (features)
The dataset contains counts for a set of \(W\) words (or features). In biology, these features would be genes, transcripts, proteins, bacterial species, etc. Each word is denoted by an index \(w \in [1:W]\). The number of word \(w\) found in a specific document \(d\) is denoted by \(c_{w,d}\).
LDA model matrices
LDA models are defined by two matrices:
\(\beta\), which is a \(K \times W\) matrix where element \(\beta_{k,w}\) provides the proportion of word \(w\) in topic \(k\), and
\(\gamma\), which is a \(D \times K\) matrix where element \(\gamma_{d,k}\) provides the proportion of topic \(k\) in document \(d\).
In order words, an LDA finds topics such that each document is optimally described as a mixture of topics (\(\gamma\)), themselves characterized by a word probability (\(\beta\)).
For objectives (1) and (2), we can align topics either using the \(\beta\) or the \(\gamma\) matrices from each model \(m\). For objective (3), only matrix \(\gamma\) can be used to align topics and for objective (4) only matrix \(\beta\) can be used.
We will first consider the problem of aligning topics using the \(\gamma\) matrices, then consider the twin problem of aligning topics using the \(\beta\) matrices and discuss similarities and differences.
In both case, in addition to aligning topics between successive models (e.g. successive values of K or \(alpha\), or manually ordered modalities or samples), we are also interested in computing and visualizing the alignment between each model.
The alignment based on the \(\gamma\) matrices is a simpler problem than the alignment based on the \(\beta\) matrices because we can assign a mass to each document and compute how the mass of each document is transferred between topics of a first model and topics of a second model, since the \(\gamma\) elements can be interpreted as the distribution of mass from each document between the topics.
To compute the alignment based on the \(\gamma\) matrices, we simply compute the proportion of mass transferred between each topic of successive models as \(w^{\gamma}_{k^m, k^{m+1}} = \frac{1}{D} \sum_{d}^D \gamma_{d,k^m} \ \gamma_{d, k^{m+1}}\).
Consequently, the “height” (or total mass) of each topic \(h_{k^m}\) is \(h_{k^m} = \sum_d \gamma_{d,k^m}\). Topics that are the main topics of many documents have a larger “height” that topics that are secondary topics of many documents or the main topic of few documents.
And if we desire to split these weights by topics of a reference model, we can assign each document \(d\) to topic of reference \(k_R\). This topic of reference is defined as the topic of the reference model \(m_R\) with the largest proportion for this document: \(k^R_d = \arg \max \gamma_{d,k^R}\).
Then, the alignment becomes $w^{}{k^m, k{m+1},kR} = d{DR} {d,k^m} {d, k^{m+1}} $.
Aligning topics based on the \(\beta\) matrices is more complex as it requires to optimize (instead of merely compute) the mass transfer between topics. That is because there is no clear concept of mass conservation between the topics in the absence of document.
[NOTE: what follows in this section will change - it’s probably not useful]
To align topics based on the distribution of word probability in these topics, we first define the following concepts:
the average word frequency: \(f_w = \frac{1}{D} \sum_d^D f_{w,d}\) with \(f_{w,d} = \frac{c_{w,d}}{\sum_w^W c_{w,d}}\)
the “topic height”: \(h_{k^m} = \sum_w^W f_w \ \beta_{w,k}\)
the “reference topic height” in each topic: \(h_{k^m, k^R} = \sum_w^{W^R} f_w \ \beta_{w,k^m}\)
[These definitions are sufficient to draw the composition of each topic for each \(m\), but to draw the flow between the topics, we need to find the optimal mass transfer]
We have implemented the methods described above in a series of functions which can be ran sequentially:
run_LDA_models, which runs the LDA models for a specified set of \(K\)s, \(\alpha\)s, modalities or datasets. [NOTE:so far, it’s only implemented for K]
align_topics performs the topic alignment on the \(\gamma\) and/or the \(\beta\) matrices when possible [NOTE: only the alignment based on \(\gamma\) is implemented] and re-order the topics so that most align topics are “close to each others” [not sure how to formulate this]. If a reference model is not provided, the last model is used as a reference.
visualize_topic_alignment computes the visualization layout and return a ggplot object with the alignment flow between topics.
Below is an example of how these functions are used on vaginal microbiome data.
# Libraries to attach
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(topicmodels)
library(slam)
# load the topic alignment functions
source("align_topic_functions.R")
# viz default theme
theme_set(theme_minimal())
load(file = "vm_16s_data.Rdata", verbose = TRUE)
## Loading objects:
## vm_16s
new_asv_names = colnames(vm_16s) %>%
str_split_fixed(., " ", n = 8) %>%
as.matrix() %>% .[,c(6, 7, 8)] %>%
as.data.frame() %>%
set_colnames(c("genus","species","strain")) %>%
mutate(short_name =
str_c(genus, " ",
species %>% str_replace(.,"NA","-")," ",
strain)) %>%
select(short_name) %>% unlist()
j = which(duplicated(new_asv_names))
new_asv_names[j] = str_c(new_asv_names[j], " (", 1:length(j),")")
colnames(vm_16s) = new_asv_names
vm_16s <- slam::as.simple_triplet_matrix(vm_16s %>% round())
topic_models_dir = "lda_models/"
lda_models =
run_lda_models(
data = vm_16s,
Ks = 1:13,
method = "VEM",
seed = 2,
dir = topic_models_dir
)
names(lda_models)
## [1] "betas" "gammas"
head(lda_models$betas)
## # A tibble: 6 x 5
## m K k_LDA w b
## <fct> <dbl> <chr> <chr> <dbl>
## 1 1 1 a Lactobacillus iners 1 0.297
## 2 1 1 a Lactobacillus crispatus 1 0.239
## 3 1 1 a Lactobacillus iners 2 0.0353
## 4 1 1 a Lactobacillus gasseri 1 0.0448
## 5 1 1 a Megasphaera - 1 0.0346
## 6 1 1 a Lactobacillus jensenii 1 0.0392
head(lda_models$gammas)
## # A tibble: 6 x 5
## m K k_LDA d g
## <fct> <dbl> <chr> <chr> <dbl>
## 1 1 1 a 1005601068 1
## 2 1 1 a 1005601078 1
## 3 1 1 a 1005601088 1
## 4 1 1 a 1005601098 1
## 5 1 1 a 1005601108 1
## 6 1 1 a 1005601118 1
aligned_topics =
align_topics(
data = asv_for_topic,
lda_models = lda_models
)
names(aligned_topics)
## [1] "lda_models" "gamma_alignment" "topics_order"
head(aligned_topics$gamma_alignment)
## # A tibble: 6 x 10
## m m_next m_ref k_LDA k_LDA_next k_LDA_ref w k k_next k_ref
## <fct> <fct> <fct> <chr> <chr> <chr> <dbl> <int> <int> <int>
## 1 1 2 13 a a a 0.00639 1 1 3
## 2 1 2 13 a a b 0.0104 1 1 13
## 3 1 2 13 a a c 0.00110 1 1 10
## 4 1 2 13 a a d 0.0141 1 1 12
## 5 1 2 13 a a e 0.00467 1 1 4
## 6 1 2 13 a a f 0.0666 1 1 5
# head(aligned_topics$beta_alignment) # not implemented
ggplot(aligned_topics$topics_order, aes(x = m, y = k, col = k_LDA)) +
geom_text(aes(label = k_LDA)) + guides(col = FALSE)
g_aligned_topics =
visualize_aligned_topics(
aligned_topics = aligned_topics,
add_leaves = TRUE,
min_beta = 0.05,
add_words_labels = TRUE
)
g_aligned_topics
g_aligned_topics =
visualize_aligned_topics(
aligned_topics = aligned_topics,
add_leaves = FALSE
)
g_aligned_topics
g_aligned_topics_ref =
visualize_aligned_topics(
aligned_topics = aligned_topics,
color_by = "reference",
add_leaves = FALSE
)
g_aligned_topics_ref
g_aligned_topics_ref =
visualize_aligned_topics(
aligned_topics = aligned_topics,
color_by = "reference",
add_leaves = TRUE
)
g_aligned_topics_ref